EasyVisa Project

Context:

Business communities in the United States are facing high demand for human resources, but one of the constant challenges is identifying and attracting the right talent, which is perhaps the most important element in remaining competitive. Companies in the United States look for hard-working, talented, and qualified individuals both locally as well as abroad.

The Immigration and Nationality Act (INA) of the US permits foreign workers to come to the United States to work on either a temporary or permanent basis. The act also protects US workers against adverse impacts on their wages or working conditions by ensuring US employers' compliance with statutory requirements when they hire foreign workers to fill workforce shortages. The immigration programs are administered by the Office of Foreign Labor Certification (OFLC).

OFLC processes job certification applications for employers seeking to bring foreign workers into the United States and grants certifications in those cases where employers can demonstrate that there are not sufficient US workers available to perform the work at wages that meet or exceed the wage paid for the occupation in the area of intended employment.

Objective:

In FY 2016, the OFLC processed 775,979 employer applications for 1,699,957 positions for temporary and permanent labor certifications. This was a nine percent increase in the overall number of processed applications from the previous year. The process of reviewing every case is becoming a tedious task as the number of applicants is increasing every year.

The increasing number of applicants every year calls for a Machine Learning based solution that can help in shortlisting the candidates having higher chances of VISA approval. OFLC has hired your firm EasyVisa for data-driven solutions. You as a data scientist have to analyze the data provided and, with the help of a classification model:

Data Description

The data contains the different attributes of the employee and the employer. The detailed data dictionary is given below.

Importing necessary libraries and data

Data Overview

Observations

  1. no_of_employees and yr_of_estab are numrical variables. The rest are categorical Variables.
  2. prevailing_wage is a float variable.
  3. All the categorical variables should be converted to Categorical Variables.

Observations

  1. There are 25480 rows and 11 columns

To check for the duplicates

Observations

  1. There are no duplicate rows.

Observations

  1. All columns have 25480 observations indicating no missing values.
  2. All the categorical variables should be converted to Categorical Variables.

Observation

  1. no_of_employees in some rows seems to be negative.
  2. no_of_employees vary widely from -26 to 602069.
  3. yr_of_est ranges widely between 1800 and 2016.
  4. Also, Prevailing_wage is float variable ranging 2.137 to 319210.27. This also indicates that there are many outliers.

Observation

  1. The variable continent has 6 unique variations.
  2. The variables education_of_employee, unit_of_wage have 4 unique variations
  3. Variable has_job_experience, requires_job_training, full_time_position, case_status have 2 variations.
  4. The variable region_of_employee has 5 variations.

Missing value

Observations

  1. There are no missing values in the data provided.

Observations

  1. 66% of the cases are certified.
  2. Most of the employees get yearly wages.
  3. More than 80% of the employees are full-time

Check for unique values in the column

Observation

  1. The mean of no_of_employees is greater than 50% value, it is positive skewed.
  2. yr_of_estab vary between 1800 and 2016.
  3. prevailing_wage ranges from 2.137 and 219210.27.

Observation

  1. There are no missing values.

Convert the object variables to Categorical Variables.

Observation

The object variables are converted to categorical variables.

Exploratory Data Analysis (EDA)

Questions:

  1. Those with higher education may want to travel abroad for a well-paid job. Does education play a role in Visa certification?

  2. How does the visa status vary across different continents?

  3. Experienced professionals might look abroad for opportunities to improve their lifestyles and career development. Does work experience influence visa status?

  4. In the United States, employees are paid at different intervals. Which pay unit is most likely to be certified for a visa?

  5. The US government has established a prevailing wage to protect local talent and foreign workers. How does the visa status change with the prevailing wage?

Exploratory Data Analysis

Let us explore Numerical Value first

Observations on no_of_employees

Observations.

  1. The graph is positively skewed.
  2. There are many outliers.
  3. Most of the companies employee count is less than 20,000 Although there are companies where the employee count is 600000.

Observations on yr_of_estab

Observations

  1. The graph is negatively skewed.
  2. Most of the companies are established between 1980 and 2000.
  3. There are many outliers.

Observations on prevailing_wage

Observations

  1. The graph is positively skewed.
  2. There are many outliers.

Lets explore the Categorical Variable.

Observations on continent

Observation

  1. Most of the employees are from Asia.
  2. Very less number of employees are from Oceania.

Observations on education_of_employee

Observation

  1. 40.2% of the employees have Bachelors degree. 37.8% of the employees have Master's degree.
  2. Very less number of employees(8.6%) of them have Doctor's degree.

Observation on has_job_experience

Observation

  1. 58.1% of the applicants have job experience.

Observation on requires_job_training

Observation

  1. 88.4% of the applicants donot require job training

Observations on region_of_employment

Observation.

  1. Very less(1.5%) of the applicants are from Island.
  2. Most of the employees are from Northeast, South and West regions.

Observations of unit_of_wage

Observation

  1. 90.1% of the employees receive year wages. 8.5% of the employees receive Hour wages.
  2. 0.3% of the employees receive Month wages. 1.1% of the employees receive Week wages.

Observations on full_time_position

Observation

  1. 89.4% of the employees are in full_time_position. Rest are not.

Observations on case_status

Observation

  1. 66.8% of the cases are Certified.

Bivariate Analysis

Plotting the Bivariate Analysis to understand the interaction with each other

Observation

  1. Highest correlation from the above heatmap is between yr_of_estab and no_of_employees (0.018) which is very less.
  2. It is important to note that correlation does not imply causation.
  3. Least correlation is between no_of_employees and prevailing_wage

Bivariate Scatter Plots

Observation

  1. Highest correlation from the above heatmap is between yr_of_estab and no_of_employees (0.018) which is very less.
  2. It is important to note that correlation does not imply causation.
  3. Least correlation is between no_of_employees and prevailing_wage

Check continent by case_status

Observation

  1. Most of the applicants whose case status is certified are from Asia.
  2. Most of the applicants whose case status is Denied are also from Asia.
  3. Least rejections are for those aplicants from Oceania.

Check education_of_employee count by case_status

Observation

  1. Most of the applicants whose case status is certified have Master's forllowed by candidates with Bachelor's degree.
  2. Rejections are very less among Doctorate candidates.

Check has_job_experience by case_count

Observations

  1. Most of the applicants whose case_status is certified have job experience.
  2. There are more rejections among applicants whose case_status is denied.

check requires_job_training by case_count

Observations

  1. Most of the candidates whose case is certified are those who did not require job training.
  2. Very less rejections are among those who required job training.

check region_of_employment by case_count

Observation

  1. Most of the certified applicantions are from south region followed by Northeast and West.
  2. Very less rejected applications are from Island.

check unit_of_wage by case_count

Observation

  1. Most of the employees receive yearly wages.
  2. The maximum cases are Certified among Yearly waged applicants.
  3. Least rejections are among Monthly waged and weekly waged applicants.

Check full_time_position by case_status

Observation

  1. Most of the certified applicants have full_time_position.
  2. Very less rejected candidates have full_time_position.

no_of_employees Vs case_status

Observation

  1. There are many Certified applicants than rejected.

prevailing_wage Vs case_status

Observation

  1. Certified applicants have more prevailing_wages than rejected applicants.

Relationship between no_of_employees and yr_of_estab

Observation

  1. The number of employees are more among the companies established between 1980 and 2000.
  2. Very less number of employees belong to the companies established in 1800

Relationship between yr_of_estab and prevailing_wage

Observation

  1. High Prevailing wages are among companies established between 1975 and 2000.
  2. Less Prevailing wages are among companies established in 1800s.

Relationship between no_of_employees and prevailing_wage

Observation

  1. Most of the employees have prevailing_wages less than 150000.
  2. Very less number of employees have higher prevailing_wages.

Correlation between no_of_employees, yr_of_estab and case_status

Observation

  1. Most of the cases certified are from the companies with employee count less than 200000 and also who belong to the companies established between 1950 and 2000.

Correlation between yr_of_estab, prevailing_wage and case_status

Observation

  1. Most of the cases certified are employees who belong to the companies established between 1950 and 2000 and have prevailing_wages less than 150000.

Correlation between no_of_employees, prevailing_wage and case_status

Observation

  1. Most of the cases certified are employees who belong to the companies established between 1950 and 2000 and have prevailing_wages less than 150000.

Correlation between no_of_employees, continent and case_status

Observation

  1. Less Rejections are among the applicants who belong to companies whose employee count is less and belong to the region Africa and Oceania.
  2. Most of the cases are certified among the applicants from Africa followed by North America and Europe.

Correlation between no_of_employees, education_of_employee and case_status

Observation

  1. Most of the cases certified are the applicants who belong to the company having more employee count and having Doctorate degree.
  2. More rejections are among the applicants who have Doctorate degree.

Correlation between no_of_employees, has_job_experience and case_status

Observation

  1. Equal number of cases are certified among the applicants with or without job experience. has_job_experience criteria is less dependent on the case_status.

Correlation between no_of_employees, requires_job_training and case_status

Observation

  1. Rejections doesnot depend if the candidates requires job training.
  2. There are more cases certified among the candidates who donot require job training.

Correlation between no_of_employees, region_of_employment and case_status

Observation

  1. More cases are certified among the applicants who belong to the companies with more employee count and belong to Island.
  2. Least rejections are among the applicants from Midwest. Maximum certified cases are among the applicants from Island.

Correlation between no_of_employees, unit_of_wage and case_status

Observations

  1. More applicants are certified who belong to the companies whose employee count is high and receive weekly wages.
  2. More applicants are rejected among those who receive yearly wages.

Correlation between no_of_employees, full_time_position and case_status

Observation

  1. Less rejections are seen among the employees who have full_time_position and belong to the companies with more employee count.
  2. More case are certified among the employees who do not have full_time_position.

Observation

  1. Most of the cases certified are among the employees who belong to the companies established between 1950 and 2000.
  2. Very less cases are certified among the employees who belong to the companies established in 1800s.

Observation

There are many Certified applicants than rejected.

1. Those with higher education may want to travel abroad for a well-paid job. Does education play a role in Visa certification?

Observation

  1. In the given dataset, there are more applicants with Bachelor's degree followed by Master's degree.
  2. More cases are certified among the employees who have Master's degree followed by Bachelor's degree.
  3. Least rejections are among the employees who have Doctorate degree.

2. How does the visa status vary across different continents?

Observation

  1. More cases are certified among the employees who belong to Asia.
  2. Least cases are certified among the employees who belong to Oceania.
  3. Less rejections are made among the employees who belong to Africa followed by South America.

Observation

  1. Less Rejections are among the employees from Africa and Oceania.
  2. Most of the cases are certified among the applicants from Africa followed by North America and Europe.

3. Experienced professionals might look abroad for opportunities to improve their lifestyles and career development. Does work experience influence visa status?

Observation

  1. From the given dataset, there are more employees who have job experience.
  2. More cases are certified among the employees who have job exoerience.
  3. More rejections are among the employees who donot have job experience.

Observation

Equal number of cases are certified among the applicants with or without job experience. has_job_experience criteria is less dependent on the case_status.

4. In the United States, employees are paid at different intervals. Which pay unit is most likely to be certified for a visa?

Observation

  1. Most of the employees receive yearly wages.
  2. The maximum cases are Certified among Yearly waged applicants.
  3. Least rejections are among Monthly waged and weekly waged applicants.

5. The US government has established a prevailing wage to protect local talent and foreign workers. How does the visa status change with the prevailing wage?

Observation

Certified applicants have more prevailing_wages than rejected applicants.

Insights

  1. Most of the cases certified and also rejected are among the applicants from Asia. Less rejections are from Oceania.
  2. There are more certified applicants from Africa, followed by North America and Europe.
  3. 40.2% of the employees have Bachelors degree. 37.8% of the employees have Master's degree. Very less number of employees(8.6%) of them have Doctor's degree.
  4. Most of the applicants whose case status is certified have Master's forllowed by candidates with Bachelor's degree. Rejections are very less among Doctorate candidates.
  5. 58.1% of the applicants have job experience. has_job_experience criteria is less dependent on the case_status.
  6. 88.4% of the applicants donot require job training. Most of the candidates whose case is certified are those who did not require job training.
  7. Most of the employees are from Northeast, South and West regions. A small number(1.5%) of the applicants are from Island. Most of the certified applicantions are from south region followed by Northeast and West. Very less rejected applications are from Island. Least rejections are among the applicants from Midwest.
  8. 90.1% of the employees receive year wages. 8.5% of the employees receive Hour wages. 0.3% of the employees receive Month wages. 1.1% of the employees receive Week wages. The maximum cases are Certified among Yearly waged applicants. Least rejections are among Monthly waged and weekly waged applicants.
  9. 89.4% of the employees are in full_time_position. Most of the certified applicants have full_time_position.
  10. 66.8% of the cases are Certified. Most of the cases certified are among the employees who belong to the companies established between 1950 and 2000.
  11. Certified applicants have more prevailing_wages than rejected applicants. High Prevailing wages are among companies established between 1975 and 2000 and employees having prevailing_wages less than 150000 .

Data Preprocessing

Missing Value Treatment

Check for missing value

Observation

There are no missing values. Hence, there is no need for missing value treatment.

Duplicate Rows Check

Observations

There are no duplicate rows.

Feature Engineering

Convert the negative values to positive using Floor Function.

Since, the no_of_employees column containe negative values. This number cannot be negative. Hence, converting them to positive values using abs function.

Observation

  1. Negative values are converted to positive values
Since, the prevailing_wages are according to the unit_of_wages. Converting them into one unit makes the comparision easier. Here, unit_of_wages column is not altered.

Observation

All the prevailing wages are converted to one unit. Keeping the unit_of_wages column same.

Outlier Treatment

Observation

  1. There are many outliers among the no_of_employees, ye_of_estab and prevailing_wage. All need to be treated here.

Observation

  1. Outlier treatment is done.

Data Preparation for Modeling

Converting the categorical variables to integers.

Observation

  1. continent, education_of_employee, region_of_employment and unit_of_wage are categorical variables.
  2. prevailing_wage is a float Vriable.
  3. has_job_experience, requires_job_training and no_of_employees, full_time_position and case_status are integer variables.

EDA

Observations on no_of_employees

Observation

  1. The graph is positively skewed.
  2. number of employees in most of the cases are below 2000.
  3. Outliers are treated.

Observations.

  1. Most of the companies are establieshed between 2000 and 2020.
  2. The graph is negatively skewed.
  3. The outliers are treated.

Observations on continent

Observation

Most of the employees are from Asia. Very less number of employees are from Oceania.

Observations on education_of_employee

Observation

40.2% of the employees have Bachelors degree. 37.8% of the employees have Master's degree. Very less number of employees(8.6%) of them have Doctor's degree.

Observation on has_job_experience

Observation

58.1% of the applicants have job experience.

Observation on requires_job_training

Observation

88.4% of the applicants donot require job training

Observations on region_of_employment

Observation.

Very less(1.5%) of the applicants are from Island. Most of the employees are from Northeast, South and West regions.

Observations of unit_of_wage

Observation

90.1% of the employees receive year wages. 8.5% of the employees receive Hour wages. 0.3% of the employees receive Month wages. 1.1% of the employees receive Week wages.

Observations on full_time_position

Observation

89.4% of the employees are in full_time_position. Rest are not.

Observations on case_status

Observation

66.8% of the cases are Certified.

Bivariate Analysis

Plotting the Bivariate Analysis to understand the interaction with each other

Observation

  1. Highest correlation from the above heatmap is case_status and has_job_experience followed by requires_job_training and full_time_position.
  2. It is important to note that correlation does not imply causation.
  3. Some are negatively correlated.

Bivariate Scatter Plots

Observation

Highest correlation from the above heatmap is case_status and has_job_experience followed by requires_job_training and full_time_position. It is important to note that correlation does not imply causation. Some are negatively correlated.

Check continent by case_status

Observation

Most of the applicants whose case status is certified are from Asia. Most of the applicants whose case status is Denied are also from Asia. Least rejections are for those aplicants from Oceania.

Check education_of_employee count by case_status

Observation

Most of the applicants whose case status is certified have Master's forllowed by candidates with Bachelor's degree. Rejections are very less among Doctorate candidates.

Check has_job_experience by case_count

Observations

Most of the applicants whose case_status is certified have job experience. There are more rejections among applicants whose case_status is denied.

check requires_job_training by case_status

Observations

Most of the candidates whose case is certified are those who did not require job training. Very less rejections are among those who required job training.

check region_of_employment by case_status

Observation

Most of the certified applicantions are from south region followed by Northeast and West. Very less rejected applications are from Island.

check unit_of_wage by case_count

Observation

Most of the employees receive yearly wages. The maximum cases are Certified among Yearly waged applicants. Least rejections are among Monthly waged and weekly waged applicants.

Check full_time_position by case_status

Observation

Most of the certified applicants have full_time_position. Very less rejected candidates have full_time_position.

Relationship between no_of_employees and yr_of_estab

Observation

Companies established between 2000 and 2020 have higher employee count.

Relationship between yr_of_estab and prevailing_wage

Observation

High Prevailing wages are among companies established between 1975 and 2000.

Relationship between no_of_employees and prevailing_wage

Observation

  1. More prevailing wages are given to the employee whose company employee count is less than 5000.

Correlation between no_of_employees, yr_of_estab and case_status

Observation

  1. Case_status = 1 when company's year of establishment is between 1980 and 2020 and employee count less than 5000.

Correlation between yr_of_estab, prevailing_wage and case_status

Observation

  1. Accpetance criteria is more when prevailing_wage is less than 15000 and company year of establishment is between 1980 and 2020.

Correlation between no_of_employees, prevailing_wage and case_status

Observation

  1. Visa status acceptance criteria is more when the company's employee count is less than 5000 and prevailing_wage is less than 150000.

Correlation between no_of_employees, continent and case_status

Observation

After outlier treatment, there is a change in the graph, The employee belonging to the companies with higher employee count and region - Africa has highest acceptance. The employee belonging to the companies with less employee count and region - Oceania has highest rejections.

1. Those with higher education may want to travel abroad for a well-paid job. Does education play a role in Visa certification?

Observation

In the given dataset, there are more applicants with Bachelor's degree followed by Master's degree. More cases are certified among the employees who have Master's degree followed by Bachelor's degree. Least rejections are among the employees who have Doctorate degree.

2. How does the visa status vary across different continents?

Observation

More cases are certified among the employees who belong to Asia. Least cases are certified among the employees who belong to Oceania. Less rejections are made among the employees who belong to Africa followed by South America.

Observation

  1. After outlier treatment, there is a change in the graph,
  2. The employee belonging to the companies with higher employee count and region - Africa has highest acceptance.
  3. The employee belonging to the companies with less employee count and region - Oceania has highest rejections.

3. Experienced professionals might look abroad for opportunities to improve their lifestyles and career development. Does work experience influence visa status

Observation

From the given dataset, there are more employees who have job experience. More cases are certified among the employees who have job exoerience. More rejections are among the employees who donot have job experience.

Observation

Aplicants from companies with more employee count and has job experience have higher chances of being accepted.

4. In the United States, employees are paid at different intervals. Which pay unit is most likely to be certified for a visa?

Observation

Most of the employees receive yearly wages. The maximum cases are Certified among Yearly waged applicants. Least rejections are among Monthly waged and weekly waged applicants.

5. The US government has established a prevailing wage to protect local talent and foreign workers. How does the visa status change with the prevailing wage?

Observation

Certified applicants have more prevailing_wages than rejected applicants.

Checking Multicollinearity

Building the model

Model Evaluation Criterion

Model can make wrong predictions as:
  1. Predicting a case_status gets certified but in reality the OFLC rejects the case.
  2. Predicting a case_status gets rejected, the case will never go to OFLC.
Which case is more important?

Both the cases are important.

  1. Predicting a case_status doesnot get certified but in reality OFLC rejects as this might result in huge loss leading to number of negative applications, huge time loss and resource crunch.
  2. Predicting a case_status gets rejected, the case will never go to OFLC. This will result in loosing a good candidate. Since there are many applications, loosing some might not affect much.

How to reduce this loss i.e need to reduce False Negatives?

  1. Recall gives the ratio of True posiitives to Actual Positives. So, high recall means less false negatives. Lower chances of predicting the certified as rejected.
  2. Recall should be maximized, the greater the Recall higher the chances minimizing false negatives. ##### First, let's create functions to calculate different metrics and confusion matrix so that we don't have to use the same code repeatedly for each model.
  3. The make_confusion_matrix_for_model function will be used to plot confusion matrix.
  4. get_metrics_score fuction is used to calculate the metrics.

Building a Decision Tree model

  1. We will build our model using the DecisionTreeClassifier function.
  2. Using default 'gini' criteria to split. Other option include 'entropy'.

Confusion matrix

Visualize the dTree

As per the decision tree model, no_of_employees is the important variable for predicting

The above tree is very complex, such a tree usually overfits.

Reduce the OverFit

The deeper the tree, the more complex the model because it will have more splits and this captures more information and this is one of the causes for overfitting. Now, Lets try to limit the tree to 3.

Do we need to prune the tree?

Confusion Matrix - dTree with depth limited to 3

Lets Visualize the decision tree

Observation

  1. Important features changes to education_of_employee_HighSchool.
  2. Recall on training set : 0.926970536388819, Recall on test set : 0.929285014691479 which shows that the model is doing better. But, accuracy is 0.72.
  3. Lets see if we can further improve.

Reducing the OverFit

Using GridSearch for Hyperparameter tuning to the tree model

Bagging and Boosting Models

Building the model

  1. lets build 2 ensemble models here - Bagging Classifier and Random Forest Classifier.
  2. let's build these models with default parameters first and then use hyperparameter tuning to optimize the model performance.
  3. The metric of interest here is recall.

Bagging Classifier

Observation

  1. Recall on training data is 0.98 and on testing data is 0.77. Lets see if there is any scope for further improvement upon tuning.

Random Forest Classifier

Observation

  1. Random Forest Model seems to over fit on training data and its 0.83 on testing data.
  2. Lets see if there is any scope for the improvement.

Will tuning the hyperparameters improve the model performance?

Hyperparameter Tuning

Bagging Classifier

Some of the hyperparameters used here are max_features, n_estimators and max_samples.

Confusion Matrix

Observation

On hypertuning, the recall value on training and testing data reached 1.0.

logistic regression as the base estimator for bagging classifier:

Now, lets try use logistic regression as the base estimator.

Observation

  1. Recall is 1.0 on both training and testing data.

Random Forest Classifier

Now, let's see if we can get a better model by tuning the random forest classifier. Some of the important hyperparameters available for random forest classifier are: max_sample_leaf and max_features

Observation

  1. We can see that random forest's performance has increased as compared to the random forest model with default parameters.
  2. The test recall is slightly less.

Let's try using class_weights for random forest:

The model performance is not very good. This may be due to the fact that the classes are imbalanced with 67% certified and 33% denied.

Checking the important features

Observation

  1. The important feature is education_of_employee_highschool.
  2. The model seems to be bettween after inclusing class weights. But, we also see test recall slightly less than training recall.

Boosting Model

Building the model

  1. lets build 2 ensemble models here - AdaBoost Classifier, Gradient Boosting Classifier and XGBoost Classifier..
  2. let's build these models with default parameters first and then use hyperparameter tuning to optimize the model performance.
  3. The metric of interest here is recall.

AdaBoost Classifier

Gradient Boosting Classifier

XGBoost Classifier

Observation

  1. AdaBoost and Gradient Boost models have better recall compared to XGBoost.

Will tuning the hyperparameters improve the model performance?

Hyperparameter Tuning

AdaBoost Classifier

Hyperparameters used here are base_estimator, n_estimators, learning_rate

  1. education_of_employee_highschool is the important feature as per tunes AdaBoost.

Gradient Boosting Classifier

Let's try using AdaBoost classifier as the estimator for initial predictions

Gradient Boosting Tuned

As compared to the default parameters, the metrics seems to be same

Observation

  1. The recall value has increased compared to the previous models.
  2. education_of_employee_HighSchool is the important feature.

XGBoost Classifier

Observation

  1. Important Feature is education_of_employee_High School.
  2. Though recall is 1.0, the accuracy seems to be very less.

Model Improvement Boosting

Stacking Model

Observations

  1. All the metrics obtained are pretty decent value.
  2. Recall is lesser compared to the value obtained in the previous model. Lets, compare all the models.

Model Performance Comparison and Conclusions

Observation

  1. According to the Recall definition, high recall means less false negatives. Lower chances of rejected cases as certified. Recall should be maximized, the greater the Recall higher the chances of identifying.
  2. Here, the highest recall has been obtained for Decision Tree Hypertuned, bagging_logistic_regression, bagging_estimator_tuned and XGBoost Tuned. But, the accuracy of these models are very low (0.67)
  3. Coming to the second best models better recall value and accuracy are - Random Forest Classifier -weighted and Gradient Boosting Tuned
  4. The differnce between recall values of Random Forest Classifier -weighted and Gradient Boosting Tuned models is 2%. The recall value of Random Forest Classifier -weighted is slightly more.
  5. The accuracy of Gradient Boosting Tuned models is 1% more than Random Forest Classifier -weighted.
  6. But, the differnces between respective training vs testing recall values is less in Gradient Boosting Tuned model.
  7. Hence, Gradient Boosting Tuned model is preferred.

Actionable Insights and Recommendations

From the data analysis, we can see definite patterns.
  1. Most applications and certifications are from Asia
  2. Close to 70% have Masters or Bachelors.
  3. People from Europe have a low certification rate.
  4. People from South America, North America and Asia have high certification rate.
  5. Certification is highest for people with Doctorate or Masters degree.
  6. Certification is lowest for people with High school degree.
  7. Rejection rate is high in people with no job experience.
  8. All weekly wage workers are certified.
  9. Employees of Northeast has higher percentage of certification while employees of Midwest have the highest percentage of rejection. Hence Education of the candidates (High school / Masters / Doctorate), Job experience, Prevailing wage, Unit of wage and region of employment play a key role in the certification rate.

OFLC can save a lot of resources and time by pre-filtering/ pre-sorting the candidates based on the model. They can start processing the applications where the candidates have high level of education, job experience, prevailing wage and being employed in regions like the North East. They could also raise the minimum requirements for prevailing wage, Education, job experience so that they get only high quality candidates who are more likely to be certified. Since the model selected has high recall and a decent accuracy, the model predictions on positive certifications will help reduce resource wastage while keeping the opportunity cost of losing good candidates to the minimum.